Resources

  • These slides are based heavily on portions of two books:
    • R for Data Science (Beginner to Intermediate)
      • Hadley Wickham and Garrett Grolemund; 1st edition, revised 2020
    • Advanced R (…advanced, but a fantastic book)
      • Hadley Wickham; 2nd edition, 2019

Hadley Appreciation Slide

  • Hadley Wickham is “Chief Scientist” at RStudio
  • Author of many tidyverse core packages
  • Has good talks on YouTube, e.g. here and here

First Some Context

  • There is a deeper philosophy underlying the concepts of tidy data and visualization
  • United by the concept of functional programming (FP)
    • I want to give context before we dive into ggplot and dplyr
    • Supplemental reading in Advanced R and package docs for purrr

FP Building Block: Pure Functions

  • Focus on pure functions
    • a.k.a. side-effect free, same input same output
    • Functions should be small. They should do one and exactly one thing.
  • This function is pure:
pure_function <- function(x) x + 3
  • This function is not pure because it relies on a global variable. I can change y and change the behavior of the function.
y <- 3
impure_function <- function(x) x + y

Advanced R, quote about Functional Programming (emphasis added)

It’s hard to describe exactly what a functional style is, but generally I think it means decomposing a big problem into smaller pieces, then solving each piece with a function or combination of functions. When using a functional style, you strive to decompose components of the problem into isolated functions that operate independently. Each function taken by itself is simple and straightforward to understand; complexity is handled by composing functions in various ways.

What’s the point of FP?

  • Focus on function composition allows you to separate the “pipeline” from the “data”
  • Let’s compute \(f(x) = \exp\left(\sqrt{x^2 + 3}\right)\). The “data” is \(x\)—the “pipeline” is the set of functions that we apply
square <- function(x) x^2
add3 <- function(x) x + 3
f <- . %>% square %>% add3 %>% sqrt %>% exp
f(2)
## [1] 14.09403
  • This idea of function pipelining captures the essence of functional programming

The Pipe Operator

  • The pipe operator from magrittr is a key part of our FP toolkit
    • x %>% f and x %>% f() get translated to f(x)
    • x %>% f(y) gets translated to f(x,y)
    • . %>% f %>% g gets translated to function(x) g(f(x))
  • As academics who produce reports, we can think of R scripts as pipelines that operate on our saved data files and produce some output: a figure, another dataset, an R markdown file…
    • In a sense, our R scripts should look like one long string of pipes! We take our data and pipe it through a series of small, pure functions and wind up with our output.

If you think this is cool

  • Read the section of Advanced R on functional programming here
  • Listen to Hadley’s keynote on Functional Programming on YouTube
  • Learn the Haskell Programming Language

ggplot2

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

— Hadley Wickham

The gg stands for “grammar of graphics”

  • Map columns of your dataset to features of a graph. Here’s a template:
ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) +
  [scale adjustments, labels, legend, theme, etc...]

References

This section is going to build plots using the gapminder dataset (world population, life expectancy, and GDP/capita over time)

Gapminder dataset

  • Here is a snippet:
library(gapminder)
set.seed(12345) # set pseudorandom number generator's seed for reproducibility
gapminder %>% sample_n(10)
## # A tibble: 10 × 6
##    country          continent  year lifeExp       pop gdpPercap
##    <fct>            <fct>     <int>   <dbl>     <int>     <dbl>
##  1 Bolivia          Americas   1997    62.0   7693188     3326.
##  2 Argentina        Americas   1962    65.1  21283783     7133.
##  3 Indonesia        Asia       2007    70.6 223547000     3541.
##  4 Iran             Asia       1997    68.0  63327987     8264.
##  5 Portugal         Europe     1987    74.1   9915289    13039.
##  6 Hong Kong, China Asia       1967    70     3722800     6198.
##  7 Kenya            Africa     1997    54.4  28263827     1360.
##  8 Guatemala        Americas   1972    53.7   5149581     4031.
##  9 Ghana            Africa     2002    58.5  20550751     1112.
## 10 Slovak Republic  Europe     1987    71.1   5199318    12037.

Basic Scatterplot

  • Let’s try and visualize the relationship between GDP and life expectancy
    • Take 1: scatterplot
    • aes controls the “aesthetic mapping” between columns of your dataset and parts of the graph
ggplot(gapminder) +
  geom_point(aes(x = gdpPercap, y = lifeExp))

Basic scatterplot

Basic scatterplot

Slightly Improved Scatterplot

  • This is a panel dataset, so we probably don’t want to display data for all years
    • In grammar of graphics, the year is another “dimension” we could map to an aesthetic feature
    • For now, let’s just filter the data and choose one year
  • Coding footnotes
    • Slightly annoyingly, we use %>% for the data pipeline and + to combine ggplot components
    • Also note that the aes call can either go in ggplot or in the individual geom functions. I prefer the former.
gapminder %>%
  filter(year == 2002) %>%
  ggplot(aes(gdpPercap, lifeExp)) +
    geom_point()

Scatterplot with only one year

Scatterplot with only one year

You can include multiple geometries

  • To illustrate, you can include more than one geometry in a plot
    • This plot adds a second geometry, geom_smooth, which plots a best fit line
  • Coding footnotes
    • The first two arguments to aes are always x and y, so you don’t need to put x = to specify the argument.
gapminder %>%
  filter(year == 2002) %>%
  ggplot(aes(gdpPercap, lifeExp)) +
    geom_point() +
    geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Plot with two geometries

Plot with two geometries

Let’s add some color

  • Looking in the documentation for geom_point, we see that there are 9 different aesthetic mappings we could use: color (outline) or fill of the point, its size, its transparency (alpha), etc.
  • Let’s map fill (aka color) onto the continent and size to the population
gapminder %>%
  filter(year == 2002) %>%
  ggplot(aes(gdpPercap, lifeExp, fill = continent, size = pop)) +
    geom_point(pch = 21, color = "black")

Mapping more aesthetics

Mapping more aesthetics

  • You can see that aesthetics are treated differently depending on whether they are categorial (like continent) or continuous (like population)
  • ggplot tries to pick sensible defaults
    • Which colors to use for which continents
    • What size to use for what population
    • Axis ticks for the x and y axes
  • We can use “scale adjustments” to adjust how these aesthetics are mapped
    • Template: scale_<AES_TYPE>_<SCALE_TYPE>(...)
    • Example: I want 5 axis breaks on my x axis:
[...] + scale_x_continuous(n.breaks = 5)

Scale transformations, illustrated

  • Let’s use a log x scale. There’s a function for that, scale_x_log10
  • Let’s also use a size scale transformation to make the points bigger
gapminder %>%
  filter(year == 2002) %>%
  ggplot(aes(gdpPercap, lifeExp, fill = continent, size = pop)) +
    geom_point(pch = 21, color = "black") +
    scale_x_log10() +
    scale_size_continuous(range = c(2, 18))

Log x scale, bigger points

Log x scale, bigger points

Faceting

  • Another way we can display information in ggplot is by faceting, i.e. multiple subplots for each group in a categorical column
    • Can control number of rows/columns in output via ncols/nrows argument
    • Can control whether all subplots have the same x/y axis via scales argument
gapminder %>%
  filter(year == 2002) %>%
  ggplot(aes(gdpPercap, lifeExp, fill = continent, size = pop)) +
    geom_point(pch = 21, color = "black") +
    scale_x_log10() +
    scale_size_continuous(range = c(2, 18)) +
    facet_wrap(~continent)

Choosing fixed or free scales is important when faceting

Choosing fixed or free scales is important when faceting

Labels and Legends

  • Use the labs function to change axis labels, add titles, change legend labels, etc.
  • Use the guides function to control what aesthetics get their own legends
gapminder %>%
  filter(year == 2002) %>%
  ggplot(aes(gdpPercap, lifeExp, fill = continent, size = pop)) +
    geom_point(pch = 21, color = "black") +
    scale_x_log10() +
    scale_size_continuous(range = c(2, 18)) +
    labs(x = "GDP/cap", y = "Life Expectancy", 
         title = "This graph is looking nice now", fill = "Region") +
    guides(size = "none")

Controlling labels and legends

Controlling labels and legends

Themes

  • There are built-in theme functions that start with theme_XXX—the ggthemes package provides even more. Some personal favorites are theme_minimal, theme_fivethirtyeight, theme_calc
gapminder %>%
  filter(year == 2002) %>%
  ggplot(aes(gdpPercap, lifeExp, fill = continent, size = pop)) +
    geom_point(pch = 21, color = "black") +
    scale_x_log10() +
    scale_size_continuous(range = c(2, 18)) +
    labs(x = "GDP/cap", y = "Life Expectancy", 
         title = "This graph is looking nice now", fill = "Region") +
    guides(size = FALSE) +
    theme_fivethirtyeight()
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

Experimenting with themes

Experimenting with themes

Caveat on themes

  • Whoops! It looks like the fivethirtyeight theme doesn’t include axis labels
  • Themes may override the “display style” of different elements
    • You may have to tinker with calls to theme() and plenty of Googling to fix these issues…
    • Just a heads up!

Some random visualizations

  • For illustrative purposes, here are some other ways I came up with to visualize these data
  • First, population over time
    • Note that we use the group aesthetic in a line graph to make sure the right data points are connected
ggplot(gapminder, aes(year, pop, group = country, color = continent)) +
  geom_point() +
  geom_line() +
  facet_wrap(~continent)

Line plot, facets don't help that much

Line plot, facets don’t help that much

  • There is a lot going on with that graph
    • I think it would be better to just aggregate by continent for a series plot
    • This requires some aggregation using summarise (more to come on this)
gapminder %>%
  filter(continent %in% c("Africa", "Europe", "Americas")) %>%
  group_by(continent, year) %>%
  summarise(population = sum(pop)) %>%
  ggplot(aes(year, population, color = continent)) +
    geom_point() +
    geom_line() +
    scale_y_log10()
## `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.

This plot conveys something interesting

This plot conveys something interesting

What about how GDP and life expectancy?

  • I want to aggregate by continent, then facet and show GDP on one panel and life expectancy on the other panel

  • This means that we need to pivot the data (more to come on this too)

  • Here we want to pivot_longer

plot_data <-
  gapminder %>%
    group_by(continent, year) %>%
    summarise(
      gdpPercap = weighted.mean(gdpPercap, pop),
      lifeExp = weighted.mean(lifeExp, pop)) %>%
    ungroup %>%
    pivot_longer(c(gdpPercap, lifeExp))

What did pivot_longer do?

  • Data before pivoting:
    continent year gdpPercap lifeExp
    Africa 1952 1311.221 38.79973
    Africa 1957 1444.952 40.94031

  • Data after pivoting:
    continent year name value
    Africa 1952 gdpPercap 1311.22144
    Africa 1952 lifeExp 38.79973
    Africa 1957 gdpPercap 1444.95199
    Africa 1957 lifeExp 40.94031

Now we have an aesthetic to map

  • Given we have a name variable now, we can facet on it
ggplot(plot_data, aes(year, value, color=continent)) +
  geom_point() +
  geom_line() +
  facet_wrap(~name, scales="free_y") +
  scale_y_log10()

Now prefer free y scale b/c GDP & lifeExp aren't comparable

Now prefer free y scale b/c GDP & lifeExp aren’t comparable

  • Finally, I’ll use the gganimate package to make a gif
    • If you’ve seen any gapminder visualizations before, this is what you’ve seen:
plot <- gapminder %>%
  ggplot(aes(gdpPercap, lifeExp, fill = continent, size = pop)) +
    geom_point(pch=21, color="black") +
    transition_time(year) + # from gganimate
    scale_x_log10() +
    scale_size_continuous(range=c(2, 18)) +
    guides(size = FALSE) +
    theme_fivethirtyeight() +
    labs(
      title = "How has income and life expectancy changed?", 
      subtitle = "{round(frame_time)}"
    )

anim_save("writing/rdemo_assets/animated.gif", plot)

Data Manipulation and Tidy Data

“Happy families are all alike; every unhappy family is unhappy in its own way.” — Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” — Hadley Wickham

Tidy Plots Require Tidy Data

What is Data Manipulation

  • filter, arrange, select, mutate, summarise, join
  • Going to quickly cover each of the above as well as pivoting
  • Conclude with an example
  • There is a lot here, strongly recommend going through the linked resources for a more thorough treatment

Filter

  • Reduce the number of rows in a dataset by removing rows for which a predicate expression is false
    • Remember, filter expects a boolean vector the same length as the dataset
# what countries had less than 1m people in 1962?
gapminder %>% filter(pop < 1e6 & year == 1962) 
## # A tibble: 20 × 6
##    country               continent  year lifeExp    pop gdpPercap
##    <fct>                 <fct>     <int>   <dbl>  <int>     <dbl>
##  1 Bahrain               Asia       1962    56.9 171863    12753.
##  2 Botswana              Africa     1962    51.5 512764      984.
##  3 Comoros               Africa     1962    44.5 191689     1407.
##  4 Djibouti              Africa     1962    39.7  89898     3021.
##  5 Equatorial Guinea     Africa     1962    37.5 249220      583.
##  6 Gabon                 Africa     1962    40.5 455661     6631.
##  7 Gambia                Africa     1962    33.9 374020      600.
##  8 Guinea-Bissau         Africa     1962    34.5 627820      522.
##  9 Iceland               Europe     1962    73.7 182053    10350.
## 10 Jordan                Asia       1962    48.1 933559     2348.
## 11 Kuwait                Asia       1962    60.5 358266    95458.
## 12 Lesotho               Africa     1962    47.7 893143      412.
## 13 Mauritius             Africa     1962    60.2 701016     2529.
## 14 Montenegro            Europe     1962    63.7 474528     4650.
## 15 Namibia               Africa     1962    48.4 621392     3173.
## 16 Oman                  Asia       1962    43.2 628164     2925.
## 17 Reunion               Africa     1962    57.7 358900     3174.
## 18 Sao Tome and Principe Africa     1962    51.9  65345     1072.
## 19 Swaziland             Africa     1962    45.0 370006     1856.
## 20 Trinidad and Tobago   Americas   1962    64.9 887498     4998.

Arrange

  • Re-order the rows of a dataset based on the values in a column
    • Ascending order by default
    • Use desc for descending order
# what were the five smallest countries in 1977?
gapminder %>% filter(year == 1977) %>% arrange(pop) %>% head(5)
## # A tibble: 5 × 6
##   country               continent  year lifeExp    pop gdpPercap
##   <fct>                 <fct>     <int>   <dbl>  <int>     <dbl>
## 1 Sao Tome and Principe Africa     1977    58.6  86796     1738.
## 2 Equatorial Guinea     Africa     1977    42.0 192675      959.
## 3 Iceland               Europe     1977    76.1 221823    19655.
## 4 Djibouti              Africa     1977    46.5 228694     3082.
## 5 Bahrain               Asia       1977    65.6 297410    19340.

Select

  • Select a subset of the columns of your data frame
# what were the five smallest countries in 1977? just the countries
gapminder %>% filter(year == 1977) %>% arrange(pop) %>% head(5) %>% select(country)
## # A tibble: 5 × 1
##   country              
##   <fct>                
## 1 Sao Tome and Principe
## 2 Equatorial Guinea    
## 3 Iceland              
## 4 Djibouti             
## 5 Bahrain

# select numeric columns except for `pop`
gapminder %>% select(where(is.numeric), -pop)
## # A tibble: 1,704 × 3
##     year lifeExp gdpPercap
##    <int>   <dbl>     <dbl>
##  1  1952    28.8      779.
##  2  1957    30.3      821.
##  3  1962    32.0      853.
##  4  1967    34.0      836.
##  5  1972    36.1      740.
##  6  1977    38.4      786.
##  7  1982    39.9      978.
##  8  1987    40.8      852.
##  9  1992    41.7      649.
## 10  1997    41.8      635.
## # … with 1,694 more rows

Mutate

  • Create new columns based on existing columns (or other R vectors in your workspace)
    • The new columns should be vectors the same length as your dataset
# What were the top 5 GDP countries in 2002?
gapminder %>% 
  filter(year == 2002) %>%
  mutate(totalGDP = pop * gdpPercap) %>% 
  arrange(desc(totalGDP)) %>%
  select(country, pop:totalGDP) %>%
  head(5)
## # A tibble: 5 × 4
##   country              pop gdpPercap totalGDP
##   <fct>              <int>     <dbl>    <dbl>
## 1 United States  287675526    39097.  1.12e13
## 2 China         1280400000     3119.  3.99e12
## 3 Japan          127065841    28605.  3.63e12
## 4 Germany         82350671    30036.  2.47e12
## 5 India         1034172547     1747.  1.81e12

# make log columns, perhaps for modeling. The {.col} is called "glue syntax"
gapminder %>%
  mutate(across(where(is.numeric), log, .names = "log_{.col}")) %>%
  head(5)
## # A tibble: 5 × 10
##   country  continent  year lifeExp    pop gdpPercap log_year log_lifeExp log_pop
##   <fct>    <fct>     <int>   <dbl>  <int>     <dbl>    <dbl>       <dbl>   <dbl>
## 1 Afghani… Asia       1952    28.8 8.43e6      779.     7.58        3.36    15.9
## 2 Afghani… Asia       1957    30.3 9.24e6      821.     7.58        3.41    16.0
## 3 Afghani… Asia       1962    32.0 1.03e7      853.     7.58        3.47    16.1
## 4 Afghani… Asia       1967    34.0 1.15e7      836.     7.58        3.53    16.3
## 5 Afghani… Asia       1972    36.1 1.31e7      740.     7.59        3.59    16.4
## # … with 1 more variable: log_gdpPercap <dbl>

Aggregate Data

  • summarise (or summarize) data by one or more groups
    • More on grouping in this article
      • N.B. group_by also affects the behavior of mutate (discussed in linked article)
# what continent had the highest GDP/cap in 1982?
gapminder %>%
  filter(year == 1982) %>%
  group_by(continent) %>%
  summarise(gdpPercap = weighted.mean(gdpPercap, pop)) %>%
  arrange(desc(gdpPercap))
## # A tibble: 5 × 2
##   continent gdpPercap
##   <fct>         <dbl>
## 1 Oceania      19155.
## 2 Europe       15782.
## 3 Americas     14411.
## 4 Asia          2458.
## 5 Africa        2295.

Join Two Tables Together

  • If you’ve used SQL this should be familiar
    • Mutating joins are inner_join, left_join, right_join, full_join
    • Non-mutating (“inspecting”) joins are semi_join and anti_join
  • Use a common field to link data from two tables together in interesting ways
  • Tutorial on joins in dplyr here

  • Practical example: let’s use the World Bank Population dataset to study European countries
data("world_bank_pop")
head(world_bank_pop, 3)
## # A tibble: 3 × 20
##   country indicator     `2000`   `2001`   `2002` `2003`  `2004`  `2005`   `2006`
##   <chr>   <chr>          <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 ABW     SP.URB.TOTL 42444    43048    43670    4.42e4 4.47e+4 4.49e+4  4.49e+4
## 2 ABW     SP.URB.GROW     1.18     1.41     1.43 1.31e0 9.51e-1 4.91e-1 -1.78e-2
## 3 ABW     SP.POP.TOTL 90853    92898    94992    9.70e4 9.87e+4 1.00e+5  1.01e+5
## # … with 11 more variables: 2007 <dbl>, 2008 <dbl>, 2009 <dbl>, 2010 <dbl>,
## #   2011 <dbl>, 2012 <dbl>, 2013 <dbl>, 2014 <dbl>, 2015 <dbl>, 2016 <dbl>,
## #   2017 <dbl>
  • Problem: there is no continent field in this data

Solution

  • Our gapminder data has a continent field, BUT it uses the full country name, not abbreviations!
  • So we need an intermediate table, which is called country codes
data("country_codes")
head(country_codes, 3)
## # A tibble: 3 × 3
##   country     iso_alpha iso_num
##   <chr>       <chr>       <int>
## 1 Afghanistan AFG             4
## 2 Albania     ALB             8
## 3 Algeria     DZA            12

  • Now we can filter the gapminder dataset first, then use two inner joins to reduce the World Bank dataset to European countries
  • Joins are very similar filter if you think about it
    • gapminder was reduced to a smaller dataset and then subsequent joins reduced the size of country_codes and then world_bank_pop
gapminder %>%
  filter(continent == "Europe") %>%
  distinct(country) %>%
  inner_join(country_codes, "country") %>%
  inner_join(world_bank_pop, c("iso_alpha" = "country")) %>%
  sample_n(3)
## # A tibble: 3 × 22
##   country iso_alpha iso_num indicator  `2000` `2001` `2002` `2003` `2004` `2005`
##   <chr>   <chr>       <int> <chr>       <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Romania ROU           642 SP.POP.GR… -0.129 -1.40   -1.83 -0.721 -0.570 -0.618
## 2 Albania ALB             8 SP.URB.GR…  0.742  0.710   2.18  2.06   1.97   1.83 
## 3 Romania ROU           642 SP.URB.GR… -0.420 -1.68   -1.97 -0.471 -0.323 -0.371
## # … with 12 more variables: 2006 <dbl>, 2007 <dbl>, 2008 <dbl>, 2009 <dbl>,
## #   2010 <dbl>, 2011 <dbl>, 2012 <dbl>, 2013 <dbl>, 2014 <dbl>, 2015 <dbl>,
## #   2016 <dbl>, 2017 <dbl>

Pivoting Data

  • tidyverse recently made the change from gather/spread to pivot_longer/pivot_wider
  • Powerful and flexible way to reshape data sets
  • Question: is this a tidy data set? How would you want this to look for plotting?
world_bank_pop %>% select(1:10) %>% head(5)
## # A tibble: 5 × 10
##   country indicator       `2000`  `2001`  `2002` `2003`  `2004`  `2005`   `2006`
##   <chr>   <chr>            <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 ABW     SP.URB.TOTL   42444     4.30e4  4.37e4 4.42e4 4.47e+4 4.49e+4  4.49e+4
## 2 ABW     SP.URB.GROW       1.18  1.41e0  1.43e0 1.31e0 9.51e-1 4.91e-1 -1.78e-2
## 3 ABW     SP.POP.TOTL   90853     9.29e4  9.50e4 9.70e4 9.87e+4 1.00e+5  1.01e+5
## 4 ABW     SP.POP.GROW       2.06  2.23e0  2.23e0 2.11e0 1.76e+0 1.30e+0  7.98e-1
## 5 AFG     SP.URB.TOTL 4436299     4.65e6  4.89e6 5.16e6 5.43e+6 5.69e+6  5.93e+6
## # … with 1 more variable: 2007 <dbl>

  • Is this a tidy data set? How would you plot this data?
    • Programming note: the matches function is a tidy select operator that matches columns via a regular expression. What is a regular expression? See Chapter 14 of R for Data Science.
world_bank_pop %>% pivot_longer(matches("20\\d\\d"), "year")
## # A tibble: 19,008 × 4
##    country indicator   year  value
##    <chr>   <chr>       <chr> <dbl>
##  1 ABW     SP.URB.TOTL 2000  42444
##  2 ABW     SP.URB.TOTL 2001  43048
##  3 ABW     SP.URB.TOTL 2002  43670
##  4 ABW     SP.URB.TOTL 2003  44246
##  5 ABW     SP.URB.TOTL 2004  44669
##  6 ABW     SP.URB.TOTL 2005  44889
##  7 ABW     SP.URB.TOTL 2006  44881
##  8 ABW     SP.URB.TOTL 2007  44686
##  9 ABW     SP.URB.TOTL 2008  44375
## 10 ABW     SP.URB.TOTL 2009  44052
## # … with 18,998 more rows

Hadley’s Vision of Tidy Data

So, Longer not Wider?

  • Generally prefer long format for many things. There is some tradeoff
    • For example, if you want to make a pretty table, you may want a wider format
gapminder %>%
  filter(year %>% between(1990, 2010)) %>%
  select(-continent) %>%
  filter(country == "Afghanistan") %>%
  select(-country) %>%
  pivot_longer(c(pop, gdpPercap, lifeExp)) %>%
  pivot_wider(names_from = year, values_from = value) %>%
  kbl(digits = 2) %>%
  kable_material()
name 1992 1997 2002 2007
pop 16317921.00 22227415.00 25268405.00 31889923.00
gdpPercap 649.34 635.34 726.73 974.58
lifeExp 41.67 41.76 42.13 43.83

My Vision of Tidy Data

  • Hard and fast rules like “variables should be columns and observations should be rows” distract from the end goal of tidying our data
    • Which columns will you map to which aesthetics?
  • Let’s illustrate by way of an example on the world_bank_pop dataset
    • Say we want to investigate urban population growth among European countries.

  • First, need to use gapminder to create a filter on continent
  • The “SP.URB.GROW” is the urban population growth indicator, which is what we care about
urban_euros <- gapminder %>%
  filter(continent == "Europe") %>%
  distinct(country) %>%
  inner_join(country_codes, "country") %>%
  inner_join(world_bank_pop, c("iso_alpha" = "country")) %>%
  filter(indicator == "SP.URB.GROW")

urban_euros %>% 
  sample_n(2) %>%
  kbl(digits = 2) %>% 
  kable_material()
country iso_alpha iso_num indicator 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
Germany DEU 276 SP.URB.GROW 0.41 0.44 0.44 0.32 0.25 0.21 0.15 0.13 0.07 0.00 0.10 -1.60 0.20 0.29 0.43 0.88 0.84 0.47
Croatia HRV 191 SP.URB.GROW -2.42 0.72 0.31 0.31 0.29 0.38 0.26 0.22 0.27 0.18 0.05 -2.85 0.03 0.08 -0.02 -0.41 -0.26 -0.71

What columns do you have?

  • You have the name of the country, and then a column for each year
  • If you try and map columns to parts of your graph, you can do interesting things
urban_euros %>%
  mutate(country = fct_reorder(country, `2000`)) %>%
  ggplot(aes(x = `2000`, y = country, fill = `2000`)) +
    geom_bar(stat="identity", color = "dark grey") +
    guides(fill = "none") +
    scale_fill_gradient2(
      low = "#FA2721",
      mid = "white",
      high = "#3607DE"
    ) +
    theme_fivethirtyeight() +
    labs(title="Urban Population Growth in 2000") +
    theme(plot.title = element_text(size = 16))

Reordered the countries to make the graph more appealing

Reordered the countries to make the graph more appealing

Why isn’t there a year column?

  • In the end you’re fundamentally limited by the lack of a year column
urban_euros %>%
  pivot_longer(matches("20\\d\\d"), names_to = "year") %>%
  sample_n(3) %>%
  kbl(digits=2) %>%
  kable_material()
country iso_alpha iso_num indicator year value
Spain ESP 724 SP.URB.GROW 2014 0.00
Belgium BEL 56 SP.URB.GROW 2003 0.48
Italy ITA 380 SP.URB.GROW 2015 0.33

This is why we tidy

  • Tidying data is for the end goal of plotting
  • Now we have a year column that we can map to an aesthetic.
    • I’ll randomly draw three countries and do a series plot
plot_countries <- urban_euros %>% distinct(country) %>% sample_n(3)
urban_euros %>%
  inner_join(plot_countries) %>%
  pivot_longer(matches("20\\d\\d"), names_to = "year") %>%
  mutate(across(year, as.numeric)) %>%
  ggplot(aes(year, value, fill = country, group = country, color = country)) +
    geom_hline(yintercept = 0, color = "black") +
    geom_line(size=2, key_glyph = "timeseries") +
    geom_point(pch = 21, color = "black", size = 4) +
    theme_fivethirtyeight() +
    labs(title = "Urban Population Growth, Select Countries")

This is the obvious thing to do, imo.

This is the obvious thing to do, imo.

We don’t lose what we did before

  • We could have made the same graph as we did earlier
    • All we need to do is filter the long dataset on year == 2000 and replace the 2000 column with value
urban_euros %>%
  pivot_longer(matches("20\\d\\d"), names_to = "year") %>%
  mutate(across(year, as.numeric)) %>%
  filter(year == 2000) %>%
  mutate(country = fct_reorder(country, value)) %>%
  
  ggplot(aes(x = value, y = country, fill = value)) +
    geom_bar(stat="identity", color = "dark grey") +
    guides(fill = "none") +
    scale_fill_gradient2(
      low = "#FA2721",
      mid = "white",
      high = "#3607DE"
    ) +
    theme_fivethirtyeight() +
    labs(title="Urban Population Growth in 2000") +
    theme(plot.title = element_text(size = 16))

We don’t lose what we did before

  • In fact you gain a dimension you can add to that plot—time!
plot <- urban_euros %>%
  pivot_longer(matches("20\\d\\d"), names_to = "year") %>%
  mutate(across(year, as.numeric)) %>%
  ggplot(aes(x = value, y = country, fill = value)) +
    geom_bar(stat="identity", color = "dark grey") +
    transition_time(year) +
    guides(fill = "none") +
    scale_fill_gradient2(
      low = "#FA2721",
      mid = "white",
      high = "#3607DE"
    ) +
    theme_fivethirtyeight() +
    labs(title="Urban Population Growth in {round(frame_time)}") +
    theme(plot.title = element_text(size = 16))

anim_save("writing/rdemo_assets/animated-2.gif", plot)

Recap

  • Our artistry and “aesthetic” comes through in discovering ways to map columns of our data to aspects of a graph
  • Data transformation and tidying is the toolkit that enables visualization
  • Functional Programming is the mindset that epitomizes data transformation and tidying
  • You’ll notice that all of the data manipulation I do in my scripts is via dplyr and tidyr verbs. This should be your goal as well!

Assignment

  • At this point you should be able to do your assignment
    • Is there any clarification that would be helpful at this point?
    • Don’t hesitate to message me on Discord or email me at nhattersley@utexas.edu

Next time

  • Random helpful tidyverse packages
  • Attendance is optional, but if you’re interested in R it’s highly encouraged